第一節：基礎繪圖函數簡介-1(1)

同樣的資訊，使用圖像相呈現較於表格/文字，通常能讓閱讀者更快的獲得資訊。
在R裡面，我們能夠畫出任何統計圖！

– 請至這裡下載本週的範例資料

dat = read.csv("Example data.csv", header = TRUE)
head(dat)

##       eGFR Disease Survival.time Death Diabetes Cancer      SBP      DBP
## 1 34.65379       1     0.4771037     0        0      1 121.2353 121.3079
## 2 37.21183       1     3.0704424     0        1      1 122.2000 122.6283
## 3 32.60074       1     0.2607117     1        0      0 118.9136 121.7621
## 4 29.68481       1            NA    NA        0      0 118.2212 112.7043
## 5 28.35726       0     0.1681673     1        0      0 116.7469 115.7705
## 6 33.95012       1     1.2238556     0        0      0 119.9936 116.3872
##   Education Income
## 1         2      0
## 2         2      0
## 3         0      0
## 4         1      0
## 5         0      0
## 6         1      0

第一節：基礎繪圖函數簡介-1(2)

我們先從幾個簡單的統計圖開始

直方圖：需要使用函數「hist()」

hist(dat[,"eGFR"])

盒鬚圖：需要使用函數「boxplot()」

boxplot(dat[,"eGFR"])

圓餅圖：需要使用函數「pie()」以及函數「table()」

pie(table(dat[,"Education"]))

長條圖：需要使用函數「barplot()」以及函數「table()」

barplot(table(dat[,"Education"]))

第一節：基礎繪圖函數簡介-1(3)

這些圖都能透過增加不同的參數增加變化，我們可以透過函數「help()」查詢它們內部的參數。舉例來說，我們可以用下列方式改變圖的顏色

– 在R裡面的顏色可以在Colors in R裡查看

– 另外，這裡教一個新函數「par()」，他可以指定繪圖環境。其中最常見的應用為把4張圖放在同一張畫布內：

par(mfrow = c(2, 2))
hist(dat[,"eGFR"], col = "red")
boxplot(dat[,"eGFR"], col = "blue")
pie(table(dat[,"Education"]), col = c("blue", "red", "green"))
barplot(table(dat[,"Education"]), col = c("gray90", "gray50", "gray10"))

你如果喜歡你畫的圖，可以透過函數「pdf()」把圖片存出去，注意最後一定要用函數「dev.off()」關掉那個PDF檔案

pdf("plot1.pdf", height = 8, width = 8, family = "serif")
par(mfrow = c(2, 2))
hist(dat[,"eGFR"], col = "red")
boxplot(dat[,"eGFR"], col = "blue")
pie(table(dat[,"Education"]), col = c("blue", "red", "green"))
barplot(table(dat[,"Education"]), col = c("gray90", "gray50", "gray10"))
dev.off()

練習1：簡單繪圖

請透過函數「help()」查詢該如何完成下面這張圖：

練習1答案

你可能要先試著執行一次boxplot的Examples：

boxplot(count ~ spray, data = InsectSprays, col = "lightgray")

接著你可以再多嘗試幾個參數，最後畫出這樣的圖：

boxplot(dat[,"eGFR"] ~ dat[,"Disease"], col = c("blue", "red"), ylab = "eGFR", xlab = "Disease", main = "eGFR value by Disease status", lwd = 1.5)

第二節：基礎繪圖函數簡介-2(1)

接著我們介紹一個強大的函數「plot()」，他支援了多種不同的圖形，其中最主要的是散布圖：

plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP")

其實，我們可以修改點的造型，例如：

plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", pch = 19)

下面有pch造型與數字的對應表：

第二節：基礎繪圖函數簡介-2(2)

你可以為你的圖形加點東西，首先我們先介紹函數「lines()」。

– 函數「lines()」的效果是按照順序把幾個點連起來，舉例來說…

– 註：函數「plot.new()」及函數「plot.window()」是拿來開一張新畫布用的！

x = c(1, 4, 7)
y = c(2, 9, 6)
plot.new()
plot.window(xlim = c(0, 10), ylim = c(0, 10))
lines(x, y)

當然，如果點夠密，你其實可以畫出圓！

z = 0:1000/100
x = sin(z) #三角函數sin
y = cos(z) #三角函數cos
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
lines(x, y)

第二節：基礎繪圖函數簡介-2(3)

學會函數「lines()」以後，我們能夠幫散布圖上加預測線了…

– 預測線的方程式，需要先用第7課所學到的函數「glm()」幫忙建立，你看得懂下面的程式碼嗎？

# 建立MODEL以及預測線的座標
X = dat[,"SBP"]
Y = dat[,"DBP"]
model = glm(Y~X)
COEF = model$coefficients
x = c(0, 200)
y = COEF[1] + COEF[2] * x

plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", pch = 19)
lines(x, y, col = "red", lwd = 2)

第二節：基礎繪圖函數簡介-2(4)

接著，我們想要建立95% CI的預測線，這比較困難，有兩種方法。

第一種方法是將截距依序平移從0到200，做法是在建立預測式時將SBP依序從從0減到200。

x = 0:1000/5
Y = dat[,"DBP"]
y = x
ci.low = x
ci.up = x

indexes = 1:length(x)

for (i in indexes) {
  X = dat[,"SBP"] - x[i]
  model = glm(Y~X)
  COEF = summary(model)$coefficients
  y[i] = COEF[1,1]
  ci.low[i] = COEF[1,1] - qnorm(0.975) * COEF[1,2]
  ci.up[i] = COEF[1,1] + qnorm(0.975) * COEF[1,2]
}
plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", pch = 19)
lines(x, y, col = "red", lwd = 2)
lines(x, ci.low, col = "red", lty = 2, lwd = 2)
lines(x, ci.up, col = "red", lty = 2, lwd = 2)

第二種作法，你需要一些線性回歸的相關知識，你知道變異數的期望值是透過矩陣乘法獲得的…

x = 0:1000/5
X = dat[,"SBP"]
Y = dat[,"DBP"]
model = glm(Y~X)
COEF = model$coefficients
VCOV = summary(model)$cov.scaled

y.list = COEF[1] + COEF[2] * x
se.list = sqrt(diag(cbind(1, x) %*% VCOV %*% t(cbind(1, x))))
ci.low.list = y.list - qnorm(0.975) * se.list
ci.up.list = y.list + qnorm(0.975) * se.list


plot(dat[,"SBP"], dat[,"DBP"], ylab = "DBP", xlab = "SBP", main = "Scatter plot of SBP and DBP", pch = 19)
lines(x, y.list, col = "red", lwd = 2)
lines(x, ci.low.list, col = "red", lty = 2, lwd = 2)
lines(x, ci.up.list, col = "red", lty = 2, lwd = 2)

第二節：基礎繪圖函數簡介-2(5)

其實，你還可以為你的圖形加點料…

函數「text()」可以為你的圖片上加文字描述

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
t = c("A", "B", "C", "D")
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
text(x, y, t)

函數「points()」可以為你的圖片上加點

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
points(x, y, pch = 1:4)

函數「legend()」可以為你的圖片加上註釋

plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
legend("topleft", c("Female", "Male"), col = c("red", "blue"), pch = c(15, 19), bg = "gray90")
legend(0, 0, c("estimates", "95% CI"), lty = c(1, 2), lwd = 2, col = "black")

函數「polygon()」可以畫多邊形

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
polygon(x, y, col = "green")

練習2：手刻一張圖

你已經會畫長條圖了吧? 在之前的課程5中也會算平均值&標準差(標準誤)對吧? 而且也知道該怎樣求出t test的結果對吧。

– 畫出下面這張圖！

##   Variable Disease:0   Disease:1   p-value
## 1 "eGFR"   "33.2±6.4" "34.4±7.2" "0.020"

練習2答案

這題的關鍵在於不斷的嘗試，最終你可以獲得下列程式碼：

m0 = 33.2
m1 = 34.4

s0 = 6.4 / sqrt(249)
s1 = 7.2 / sqrt(691)

XXX = c(m0, m1)
barplot(XXX, col = c("gray50", "white"), xlab = "Disease", ylab = "eGFR", ylim = c(0, 43))

lines(c(1.9, 1.9), c(m1 - qnorm(0.975) * s1, m1 + qnorm(0.975) * s1), lwd = 3)
lines(c(1.75, 2.05), c(m1 + qnorm(0.975) * s1, m1 + qnorm(0.975) * s1), lwd = 3)
lines(c(1.75, 2.05), c(m1 - qnorm(0.975) * s1, m1 - qnorm(0.975) * s1), lwd = 3)
lines(c(0.7, 0.7), c(m0 - qnorm(0.975) * s0, m0 + qnorm(0.975) * s0), lwd = 3)
lines(c(0.55, 0.85), c(m0 + qnorm(0.975) * s0, m0 + qnorm(0.975) * s0), lwd = 3)
lines(c(0.55, 0.85), c(m0 - qnorm(0.975) * s0, m0 - qnorm(0.975) * s0), lwd = 3)
lines(c(0.7, 0.7, 1.9, 1.9), c(36, 38, 38, 36), lwd = 3)
text(1.3, 40, "p = 0.020")
legend("topright", c("Control", "Case"), fill = c("gray50", "white"))

第三節：存活曲線(1)

基本的繪圖函數學會以後，我們現在來多學幾個特殊的統計圖型。

– 首先我們先來學存活分析最常使用的：Kaplan Meier curve

我們也可以在之前學到的套件『survival』中找到他的存在。

– 先載入這個套件吧！

library(survival)

函數「survfit()」可以用來構建存活分析需要的資訊，而函數「plot()」可以利用這些資訊直接畫出Kaplan Meier curve

– 函數「survdiff()」可以拿來計算log rank test的結果

KM.curve = survfit(Surv(dat[,"Survival.time"], dat[,"Death"])~ dat[,"Diabetes"])
KM.curve

## Call: survfit(formula = Surv(dat[, "Survival.time"], dat[, "Death"]) ~ 
##     dat[, "Diabetes"])
## 
##    79 observations deleted due to missingness 
##                       n events median 0.95LCL 0.95UCL
## dat[, "Diabetes"]=0 749    379   1.92    1.76    2.23
## dat[, "Diabetes"]=1 172     89   2.11    1.78    2.61

log_rank.test = survdiff(Surv(dat[,"Survival.time"], dat[,"Death"])~ dat[,"Diabetes"])
log_rank.test

## Call:
## survdiff(formula = Surv(dat[, "Survival.time"], dat[, "Death"]) ~ 
##     dat[, "Diabetes"])
## 
## n=921, 79 observations deleted due to missingness.
## 
##                       N Observed Expected (O-E)^2/E (O-E)^2/V
## dat[, "Diabetes"]=0 749      379    370.3     0.205     0.986
## dat[, "Diabetes"]=1 172       89     97.7     0.777     0.986
## 
##  Chisq= 1  on 1 degrees of freedom, p= 0.3

plot(KM.curve, col=c("red", "blue"), xlab="Time(month)", ylab="Free of death" ,xlim=c(0,40), main="Overall survival", mark.time = FALSE, lwd = 3)
legend("topright", c("Diabetes patient", "Healthy control"), col = c("blue", "red"), text.col = c("blue", "red"), lty = 1, bg = "gray90", lwd = 3) 
text(20,0.8, "p of log rank test = 0.321")

第三節：存活曲線(2)

在畫存活曲線的時候，我們可能希望更改座標軸的呈現方式，在R裡面支持我們自己定義座標軸的樣式，我們可以利用參數『axes = FALSE』，命令R不顯示座標軸，再利用函數「axis()」定義座標軸樣式

KM.curve = survfit(Surv(dat[,"Survival.time"], dat[,"Death"])~ dat[,"Diabetes"])

plot(KM.curve, col=c("red", "blue"), xlab="Time", ylab="Free of death" ,xlim=c(0,40), main="Overall survival", mark.time = FALSE, axes = FALSE, lwd = 3)
axis(1, at = c(0, 10, 20, 30, 40), labels = c("2010-Jan", "2010-Nov", "2011-Sep", "2012-Jul", "2013-May"), pos = 0)
axis(2, at = c(0, 0.2, 0.4, 0.6, 0.8, 1), labels = c("0%", "20%", "40%", "60%", "80%", "100%"), pos = 0)

legend("topright", c("Diabetes patient", "Healthy control"), col = c("blue", "red"), text.col = c("blue", "red"), lty = 1, bg = "gray90", lwd = 3) 
text(20,0.8, "p of log rank test = 0.321")

第三節：存活曲線(3)

使用Cox比例風險模型時也可以畫Kaplan Meier curve，這樣可以量化該因子的重要性。

– 唯一的問題是當模型中有很多變項時，我們會假定其他變項為一個定值(通常為平均數)，接著再比較有興趣的變項對存活情形的影響。

dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
Cox.model = coxph(Surv(dat[,"Survival.time"], dat[,"Death"]) ~ ., data = dat[,c("Diabetes", "SBP", "Income")])

NEW = data.frame(Diabetes = factor(0, levels = c(0, 1)),
                 SBP = mean(dat[,"SBP"], na.rm = TRUE),
                 Income = factor(0:2, levels = c(0, 1, 2)))

Predict.Surv = survfit(Cox.model, newdata = NEW)

plot(Predict.Surv, col = c("red", "darkgreen", "blue"), xlab="Time", ylab="Free of death" ,xlim=c(0, 40), main="Overall survival", mark.time = FALSE, lwd = 3)
legend("topright", c("Low income", "The middle class", "Wealthy"), col = c("red", "darkgreen", "blue"), text.col = c("red", "darkgreen", "blue"), lty = 1, bg = "gray90", lwd = 3) 
text(20,0.7, "p of Cox moel (wald test) = 0.198")

練習3：手刻KM curve

雖然你已經可以用函數「plot()」直接將Kaplan Meier curve畫出來，但是為了將來的應用，我現在要強迫各位利用下面的資訊親手畫出Kaplan Meier curve

– 物件『RATES』以及物件『TIMES』分別是時間與存活機率的對應座標

dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
Cox.model = coxph(Surv(dat[,"Survival.time"], dat[,"Death"]) ~ ., data = dat[,c("Diabetes", "SBP", "Income")])

NEW = data.frame(Diabetes = factor(0, levels = c(0, 1)),
                 SBP = mean(dat[,"SBP"], na.rm = TRUE),
                 Income = factor(0:2, levels = c(0, 1, 2)))

Predict.Surv = survfit(Cox.model, newdata = NEW)

RATES = Predict.Surv$surv
head(RATES)

##              1         2         3
## [1,] 1.0000000 1.0000000 1.0000000
## [2,] 0.9988564 0.9991142 0.9987262
## [3,] 0.9977127 0.9982282 0.9974526
## [4,] 0.9977127 0.9982282 0.9974526
## [5,] 0.9977127 0.9982282 0.9974526
## [6,] 0.9977127 0.9982282 0.9974526

TIMES = Predict.Surv$time
head(TIMES)

## [1] 0.0008229762 0.0023436942 0.0061026923 0.0067782956 0.0071076085
## [6] 0.0081918007

請你依照下列程序畫出Kaplan Meier curve

開一個空畫布
利用函數「lines()」畫出3條存活曲線(注意：由於Kaplan Meier curve在有人死的時候同個時間點會有兩個點，請你自己增加這些點)
自行繪製座標軸

練習3答案

這題的關鍵在於對KM curve的了解：

current_TIMES = c(0, TIMES)
time_seq = as.numeric(rbind(current_TIMES[-length(current_TIMES)], current_TIMES[-1]))

plot.new()
plot.window(xlim = c(0, 40), ylim = c(0, 1))

for (i in 1:ncol(RATES)) {
  
  current_RATES = c(1, RATES[,i])
  rate_seq = as.numeric(rbind(current_RATES[-length(current_RATES)], current_RATES[-1]))
  lines(time_seq, rate_seq, col = i + 1)
  
}

axis(1, at = c(0, 10, 20, 30, 40), labels = c(0, 10, 20, 30, 40), pos = 0)
axis(2, at = c(0, 0.2, 0.4, 0.6, 0.8, 1), labels = c("0%", "20%", "40%", "60%", "80%", "100%"), pos = 0)

資料視覺化1